AITopics | log loss

This work introduces a general framework for calibeating based on regret minimization. As compared to Foster and Hart's seminal calibeating work which had specialized treatments of Brier score (squared loss) and log loss, we consider a large family of proper losses that includes $α$-Tsallis losses (for $α\in [1, 2]$) and Lipschitz losses. Our results for Tsallis losses also hold for an unscaled version of Tsallis loss that recovers log loss. Our analysis is oriented around the Bregman divergence view of a proper loss. Technically, our results for the family of Tsallis losses that we consider are U-calibration results, simultaneously obtaining logarithmic regret for all losses in this family while having a weaker dependence on the dimension compared to previous results. Of potential independent interest, we also show a new regret equality for the regret of Be The Regularized Leader. This regret equality holds for general proper losses and itself is based on two results related to online updating formulas for the generalized variance, the latter being a previously introduced generalization of variance based on Bregman divergences.

artificial intelligence, machine learning, proper loss, (16 more...)

arXiv.org Machine Learning

2605.17269

Genre: Research Report (0.90)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

07fbde96bee50f4e09303fd4f877c2f3-Paper-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 01:40:14 GMT

artificial intelligence, machine learning, prediction, (20 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario (0.28)

Genre: Research Report > New Finding (0.46)

Add feedback

Toward a Characterization of Loss Functions for Distribution Learning

Nika Haghtalab, Cameron Musco, Bo Waggoner

Neural Information Processing SystemsFeb-12-2026, 10:46:46 GMT

Neural Information Processing Systems http://nips.cc/

log loss, loss function, properness, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Colorado (0.04)
North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

44ecfb60950e868a13172b935b7964a9-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 04:09:21 GMT

calibration property, loss function, small gradient assumption, (14 more...)

Neural Information Processing Systems

Country:

Europe > Norway (0.04)
Europe > France (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

7535bbb91c8fde347ad861f293126633-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 09:36:03 GMT

compressor, invariant, maximal invariant, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(16 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(4 more...)

Add feedback

07fbde96bee50f4e09303fd4f877c2f3-Paper-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 13:45:58 GMT

epinet, joint prediction, prediction, (17 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Michigan (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)

Genre: Research Report > New Finding (0.46)

Add feedback

Toward a Characterization of Loss Functions for Distribution Learning

Neural Information Processing SystemsDec-25-2025, 12:07:19 GMT

In this work we study loss functions for learning and evaluating probability distributions over large discrete domains. Unlike classification or regression where a wide variety of loss functions are used, in the distribution learning and density estimation literature, very few losses outside the dominant \emph{log loss} are applied. We aim to understand this fact, taking an axiomatic approach to the design of loss functions for distributions. We start by proposing a set of desirable criteria that any good loss function should satisfy. Intuitively, these criteria require that the loss function faithfully evaluates a candidate distribution, both in expectation and when estimated on a few samples.

characterization, loss function, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

Excess Risk Bounds for the Bayes Risk using Variational Inference in Latent Gaussian Models

Rishit Sheth, Roni Khardon

Neural Information Processing SystemsNov-21-2025, 11:22:37 GMT

We strengthen previous results for variational algorithms by showing that they are competitive with any point-estimate predictor. Unlike previous work, we provide bounds on the risk of the Bayesian predictor and not just the risk of the Gibbs predictor for the same approximate posterior.

algorithm, artificial intelligence, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Massachusetts > Middlesex County > Medford (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.96)

Add feedback

On the Entropy Calibration of Language Models

Cao, Steven, Valiant, Gregory, Liang, Percy

arXiv.org Machine LearningNov-18-2025

We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2511.11966

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(13 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Media (0.67)
Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks

Pfeffer, Joel, Kruijssen, J. M. Diederik, Gossart, Clément, Chevance, Mélanie, Millan, Diego Campo, Stecker, Florian, Longmore, Steven N.

arXiv.org Artificial IntelligenceOct-9-2025

In decentralized learning networks, predictions from many participants are combined to generate a network inference. While many studies have demonstrated performance benefits of combining multiple model predictions, existing strategies using linear pooling methods (ranging from simple averaging to dynamic weight updates) face a key limitation. Dynamic prediction combinations that rely on historical performance to update weights are necessarily reactive. Due to the need to average over a reasonable number of epochs (with moving averages or exponential weighting), they tend to be slow to adjust to changing circumstances (phase or regime changes). In this work, we develop a model that uses machine learning to forecast the performance of predictions by models at each epoch in a time series. This enables `context-awareness' by assigning higher weight to models that are likely to be more accurate at a given time. We show that adding a performance forecasting worker in a decentralized learning network, following a design similar to the Allora network, can improve the accuracy of network inferences. Specifically, we find forecasting models that predict regret (performance relative to the network inference) or regret z-score (performance relative to other workers) show greater improvement than models predicting losses, which often do not outperform the naive network inference (historically weighted average of all inferences). Through a series of optimization tests, we show that the performance of the forecasting model can be sensitive to choices in the feature set and number of training epochs. These properties may depend on the exact problem and should be tailored to each domain. Although initially designed for a decentralized learning network, using performance forecasting for prediction combination may be useful in any situation where predictive rather than reactive model weighting is needed.

artificial intelligence, inference, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.70235/allora.0x20040

2510.06444

Genre: Research Report > New Finding (0.74)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback